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Abstract — A Partition  Aware  Engine  framework 
provides  a major  increase  in  compression  with  respect  to 
all  currently  known  techniques,  both  on  web  graphs  and 
on  social  networks.  These  improvements  make  it  possible 
to  analyse  in  main  memory  significantly  larger  graphs. 
Graph  partition  quality  affects  the  overall  performance  of 
parallel  graph  computation  systems.  The  quality  of  a 
graph  partition  is  measured  by  the  balance  factor  and 
edge  cut  ratio.  A balanced  graph  partition  with  small 
edge  cut  ratio  is  generally  preferred  since  it  reduces  the 
expensive  network  communication  cost.  However, 
according  to  an  empirical  study  on  Giraph,  the 
performance  over  well  partitioned  graph  might  be  even 
two  times  worse  than  simple  random  partitions.  This  is 
because  these  systems  only  optimize  for  the  simple 
partition  strategies  and  cannot  efficiently  handle  the 
increasing  workload  of  local  message  processing  when  a 
high  quality  graph  partition  is  used.  In  this  paper,  we 
propose  a novel  partition  aware  graph  computation 
engine  named  PAGE,  which  equips  a new  message 
processor  and  a dynamic  concurrency  control  model.  In 
this  paper,  we  propose  a new  paradigm,  Partition  Aware 
Engine  framework  that  allows  parametric  control  of 
asynchrony  ranging  from  completely  asynchronous 
execution  to  partially  asynchronous  execution  to  level- 
synchronous  execution.  Partial  asynchrony  is  achieved  by 
generalizing  the  BSP  model  to  allow  each  super  step  to 
process  up  to  k levels  of  the  algorithm  asynchronously.  In 
the  model,  we  studying  on  two  heuristic  rules  to 
effectively  extract  the  system  characters  and  generate 
proper  parameters. 
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I.  INTRODUCTION 

Database  Parallel  graph  algorithms  are  currently 
expressed  in  level  synchronous  or  asynchronous 
paradigms.  Level-synchronous  paradigms  iteratively 
process  vertices  of  a graph  level  by  level.  This  model 
guarantees  the  current  level’s  computation  to  have 
completed  before  starting  the  next  one  through  the  use  of 
global  synchronizations  at  the  end  of  each  level.  Level- 
synchronous  algorithms  tend  to  perform  well  when  the 
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number  of  levels  is  small,  but  suffer  from  poor  scalability 
when  the  number  of  levels  is  large.  Bulk  synchronous 
parallel  (BSP)  algorithms  can  naturally  be  expressed  in 
this  paradigm.  The  asynchronous  paradigm  replaces 
global  synchronizations  with  point-to-point 
synchronizations,  which  can  increase  the  degree  of 
parallelism,  but  which  may  also  require  the  completion  of 
redundant  work.  For  example,  an  asynchronous  breadth- 
first  search  (BFS)  may  re-visit  vertices  multiple  times  as 
shorter  paths  are  discovered.  Choosing  the  right  paradigm 
depends  on  the  system,  input  graph,  and  algorithm.  This 
implies  different  implementations  and  optimizations  for 
algorithms,  with  no  easy  way  to  switch  between  them. 

II.  SURVEY  REVIEW 

1.  Harshvardhan  show  how  common  patterns  in  graph 
algorithms  can  be  expressed  in  the  KLA  paradigm 
and  provide  techniques  for  determining  k,  the  number 
of  asynchronous  steps  allowed  between  global 
synchronizations.  Results  of  an  implementation  of 
KLA  in  the  staple  Graph  Library  show  excellent 
scalability  on  up  to  96K  cores  and  improvements  of 
lOx  or  more  over  level  synchronous  and 
asynchronous  versions  for  graph  algorithms  such  as 
breadth-first  search,  PageRank,  k-core  decomposition 
and  others  on  certain  classes  of  real-world  graphs. 

2.  Amr  Ahmed  proposes  a framework  for  large-scale 
graph  decomposition  and  inference.  To  resolve  the 
scale,  our  framework  is  distributed  so  that  the  data 
are  partitioned  over  a shared  nothing  set  of  machines. 
They  propose  a novel  factorization  technique  that 
relies  on  partitioning  a graph  so  as  to  minimize  the 
number  of  neighbouring  vertices  rather  than  edges 
across  partitions.  System  decomposition  is  based  on  a 
streaming  algorithm.  It  is  network-aware  as  it  adapts 
to  the  network  topology  of  the  underlying 
computational  hardware. 

3.  Nathan  Backman  present  a framework  that 
parallelizes  and  schedules  workflows  of  stream 
operators,  in  real-time,  to  meet  latency  objectives.  It 
supports  data-  and  task-parallel  processing  of  all 
workflow  operators,  by  all  computing  nodes,  while 
maintaining  the  ordering  properties  of  sorted  data 
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streams.  System  show  that  a latency-oriented 
operator  scheduling  policy  coupled  with  the 
diversification  of  computing  node  responsibilities 
encourages  parallelism  models  that  achieve  end-to- 
end  latency-minimization  goals. 

4.  Lars  Backstrom  use  decision-tree  techniques  to 
identify  the  most  significant  structural  determinants 
of  these  properties.  Also  develop  a novel 
methodology  for  measuring  movement  of  individuals 
between  communities,  and  show  how  such 
movements  are  closely  aligned  with  changes  in  the 
topics  of  interest  within  the  communities. 

5.  Paolo  Boldi  presents  the  compression  techniques 
used  in  Web  Graph,  which  are  centred  around 
referentiation  and  intervalisation  (which  in  turn  are 
dual  to  each  other).  Web  Graph  can  compress  the 
WebBase  graph  (118  Mnodes,  1 Glinks)  in  as  little  as 
3.08  bits  per  link,  and  its  transposed  version  in  as 
little  as  2.89  bits  per  link. 

6.  Marco  Rosa  proposed  algorithm  with  the  Web  Graph 
compression  framework  provides  a major  increase  in 
compression  with  respect  to  all  currently  known 
techniques,  both  on  web  graphs  and  on  social 
networks.  These  improvements  make  it  possible  to 
analyse  in  main  memory  significantly  larger  graphs. 

III.  PARALELL  IMPLEMENTATION 

Layered  label  propagation  lends  itself  naturally  to  the 
task-decomposition  parallel-programming  paradigm, 
which  may  dramatically  improve  performances  on 
modern  multicore  architectures:  since  the  update  order  is 
randomised,  there  is  no  obstacle  in  updating  several  nodes 
in  parallel.  Our  implementation  breaks  the  set  of  nodes 
into  a very  small  number  of  tasks  (in  the  order  of 
thousands).  A large  number  of  threads  picks  up  the  first 
available  task  and  solves  it:  as  a result,  we  obtain  a 
performance  improvement  that  is  linear  in  the  number  of 
cores.  We  are  helped  by  Web  Graph’s  facility,  which 
allows  us  to  provide  each  thread  with  a lightweight  copy 
of  the  graph  that  shares  the  bit  stream  and  associated 
information  with  all  other  threads. 

IY.  CHALLENGES 

Latent  variable  modelling  is  a promising  technique  for 
many  analytics  and  predictive  inference  applications. 
However,  parallelization  of  such  models  is  difficult  since 
many  latent  variable  models  require  frequent 

synchronization  of  their  state.  The  power  law  nature  of 
such  graphs  makes  it  difficult  to  use  chromatic 
scheduling.  Furthermore,  the  bulk-synchronous 

processing  paradigm  of  Map-Reduce  does  not  afford  low- 
enough  latency  for  fast  convergence:  this  has  been 
reported,  e.g.  in  comparisons  between  bulk-synchronous 
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convergence  and  asynchronous  convergence. 
Consequently  there  is  a considerable  need  for  algorithms 
which  address  the  following  issues  when  performing 
inference  on  large  natural  graphs: 

Graph  Partitioning  We  need  to  find  a communication 
efficient  partitioning  of  the  graph  in  such  a manner  as  to 
ensure  that  the  number  of  neighbouring  vertices  rather 
than  the  number  of  edges  is  minimized.  This  is  relevant 
since  latent  variable  models  and  their  inference 
algorithms  store  and  exchange  parameters  that  are 
associated  with  vertices  rather  than  edges. 

Network  Topology  In  many  graph-based  applications 
the  cost  of  communication  (and  to  some  extent  also 
computation)  the  cost  of  storing  data.  Hence  it  is  desirable 
to  have  an  algorithm  which  is  capable  to  layout  data  in  a 
network-friendly  fashion  on  the  fly  once  we  know  the 
computational  resources. 

Variable  Replication  While  the  problem  of  variable 
synchronization  for  statistical  inference  with  regular 
structure  is  by  now  well  understood,  the  problem  for 
graphs  is  more  complex:  The  state  space  is  much  larger 
(each  vertex  holds  parts  of  a state),  rendering 
synchronization  much  more  costly  - unlike  in  aspect 
models  only  few  variables  are  global  for  all  partitions. 
Asynchronous  Communication  Finally  there  is  the 
problem  of  eliminating  the  synchronization  step  in 
traditional  bulk-synchronous  systems  on  graphs.  More 
specifically,  uneven  load  distribution  can  lead  to 
considerable  inefficiencies  in  the  bulk  synchronous 
setting.  After  all,  it  is  the  slowest  machine  that  determines 
the  runtime  of  each  processing  round  (e.g.  Map  Reduce). 
Asynchronous  schemes,  on  the  other  hand,  are  nontrivial 
to  implement  as  they  often  require  elaborate  locking  and 
scheduling  strategies. 

V.  DYNAMIC  CONCURRENCY  CONTROL 
MODEL 

The  concurrency  control  problem  can  be  modelled  as  a 
typical  producer-consumer  scheduling  problem,  where  the 
computation  phase  generates  messages  as  a producer,  and 
message  process  units  in  the  dual  concurrent  message 
processor  are  the  consumers.  Therefore,  the  producer- 
consumer  constraints  should  be  satisfied  when  solving  the 
concurrency  control  problem. 

The  concurrency  of  dual  concurrent  message  processor 
heavily  affects  the  performance.  But  it  is  expensive  and 
also  challenging  to  determine  a reasonable  concurrency 
ahead  of  real  execution  without  any  assumption. 
Therefore,  PAGE  needs  a mechanism  to  adaptively  tune 
the  concurrency  of  the  dual  concurrent  message 
processor.  The  mechanism  is  named  Dynamic 
Concurrency  Control  Model,  DCCM  for  short. 
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For  the  PAGE  situation,  the  concurrency  control  problem 
arises  consumer  constraints.  Since  the  behaviour  of 
producers  is  determined  by  the  graph  algorithms,  PAGE 
only  requires  to  adjust  the  consumers  to  satisfy  the 
constraints  (behaviour  of  graph  algorithms),  which  are 
stated  as  follows. 

First,  PAGE  provides  sufficient  message  process  units  to 
make  sure  that  new  incoming  message  blocks  can  be 
processed  immediately  and  do  not  block  the  whole 
system.  Meanwhile,  no  message  process  unit  is  idle. 
Second,  the  assignment  strategy  of  these  message  process 
units  ensures  that  each  local/remote  message  process  unit 
has  balanced  workload  since  the  disparity  can  seriously 
destroy  the  overall  performance  of  parallel  processing. 

VI.  CONCLUSION 

Finally  we  conclude  that  our  study  of  the  partition 
unaware  problem  in  current  graph  computation  systems 
and  its  severe  drawbacks  for  efficient  parallel  large  scale 
graphs  processing.  To  address  this  problem,  we  proposed 
a partition  aware  graph  computation  engine  named 
Parallel  Graph  Computation  using  a Partition  Aware 
Engine  that  monitors  three  high-level  key  running  metrics 
and  dynamically  adjusts  the  system  configurations. 
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